智能论文笔记

Kencorpus: A Kenyan Language Corpus of Swahili, Dholuo and Luhya for Natural Language Processing Tasks

Barack Wanjawa , Lilian Wanzare , Florence Indede , Owen McOnyango , Edward Ombui , Lawrence Muchemi

分类：自然语言处理

2022-08-25

土著非洲语言在人工智能中被归类为服务不足，并且数字包容性和信息获取差。挑战是如何在没有必要数据的情况下使用机器学习和深度学习模型。 Kencorpus是一种肯尼亚语言语料库，打算弥合有关如何收集和存储文本和语音数据的差距，足以启用数据驱动的解决方案，例如机器翻译，多语言社区中的问题回答和转录。 Kencorpus是一种主要在肯尼亚说的三种语言的语料库（文本和语音）：斯瓦希里语，Dholuo和Luhya（方言Lumarachi，Lulogooli和Lubukusu）。该语料库打算填补开发数据集的空白，该数据集可用于低资源语言的自然语言处理和机器学习任务。这些语言中的每一种都为语言语料库贡献了文本和语音数据。数据收集是由社区，学校和合作伙伴（媒体，出版商）的研究人员完成的。 Kencorpus有5,594个项目的集合，为4,442个文本（560万字）和1,152个语音文件（177小时）。基于这些数据，还开发了其他数据集，例如Dholuo和Luhya的POS标记集（分别为50,000和93,000个单词），来自Swahili文本（7,537 QA对）的问答对，以及将文本转换为Swahili（12,400句子）。数据集可用于机器学习任务，例如文本处理，注释和翻译。该项目还在QA任务的文本和机器学习语音和机器学习中为概念系统提供了证明，最初的结果证实了Kencorpus对机器学习社区的可用性。 Kencorpus是这些低资源语言的第一个此类语料库，并且是学习和共享类似作品的经验的基础。

translated by 谷歌翻译

KenSwQuAD -- A Question Answering Dataset for Swahili Low Resource Language

Barack W. Wanjawa , Lilian D. A. Wanzare , Florence Indede , Owen McOnyango , Lawrence Muchemi , Edward Ombui

分类：自然语言处理 | 机器学习

2022-05-04

The need for Question Answering datasets in low resource languages is the motivation of this research, leading to the development of Kencorpus Swahili Question Answering Dataset, KenSwQuAD. This dataset is annotated from raw story texts of Swahili low resource language, which is a predominantly spoken in Eastern African and in other parts of the world. Question Answering (QA) datasets are important for machine comprehension of natural language for tasks such as internet search and dialog systems. Machine learning systems need training data such as the gold standard Question Answering set developed in this research. The research engaged annotators to formulate QA pairs from Swahili texts collected by the Kencorpus project, a Kenyan languages corpus. The project annotated 1,445 texts from the total 2,585 texts with at least 5 QA pairs each, resulting into a final dataset of 7,526 QA pairs. A quality assurance set of 12.5% of the annotated texts confirmed that the QA pairs were all correctly annotated. A proof of concept on applying the set to the QA task confirmed that the dataset can be usable for such tasks. KenSwQuAD has also contributed to resourcing of the Swahili language.

translated by 谷歌翻译

Urban Visual Intelligence: Studying Cities with AI and Street-level Imagery

Fan Zhanga , Arianna Salazar Mirandaa , Fábio Duarte , Lawrence Vale , Gary Hack , Yu Liu , Michael Batty , Carlo Ratti

分类：计算机视觉

2023-01-02

The visual dimension of cities has been a fundamental subject in urban studies, since the pioneering work of scholars such as Sitte, Lynch, Arnheim, and Jacobs. Several decades later, big data and artificial intelligence (AI) are revolutionizing how people move, sense, and interact with cities. This paper reviews the literature on the appearance and function of cities to illustrate how visual information has been used to understand them. A conceptual framework, Urban Visual Intelligence, is introduced to systematically elaborate on how new image data sources and AI techniques are reshaping the way researchers perceive and measure cities, enabling the study of the physical environment and its interactions with socioeconomic environments at various scales. The paper argues that these new approaches enable researchers to revisit the classic urban theories and themes, and potentially help cities create environments that are more in line with human behaviors and aspirations in the digital age.

translated by 谷歌翻译

AER: Auto-Encoder with Regression for Time Series Anomaly Detection

Lawrence Wong , Dongyu Liu , Laure Berti-Equille , Sarah Alnegheimish , Kalyan Veeramachaneni

分类：机器学习 | (统计)机器学习

2022-12-27

Anomaly detection on time series data is increasingly common across various industrial domains that monitor metrics in order to prevent potential accidents and economic losses. However, a scarcity of labeled data and ambiguous definitions of anomalies can complicate these efforts. Recent unsupervised machine learning methods have made remarkable progress in tackling this problem using either single-timestamp predictions or time series reconstructions. While traditionally considered separately, these methods are not mutually exclusive and can offer complementary perspectives on anomaly detection. This paper first highlights the successes and limitations of prediction-based and reconstruction-based methods with visualized time series signals and anomaly scores. We then propose AER (Auto-encoder with Regression), a joint model that combines a vanilla auto-encoder and an LSTM regressor to incorporate the successes and address the limitations of each method. Our model can produce bi-directional predictions while simultaneously reconstructing the original time series by optimizing a joint objective function. Furthermore, we propose several ways of combining the prediction and reconstruction errors through a series of ablation studies. Finally, we compare the performance of the AER architecture against two prediction-based methods and three reconstruction-based methods on 12 well-known univariate time series datasets from NASA, Yahoo, Numenta, and UCR. The results show that AER has the highest averaged F1 score across all datasets (a 23.5% improvement compared to ARIMA) while retaining a runtime similar to its vanilla auto-encoder and regressor components. Our model is available in Orion, an open-source benchmarking tool for time series anomaly detection.

translated by 谷歌翻译

Language models are better than humans at next-token prediction

Buck Shlegeris , Fabien Roger , Lawrence Chan , Euan McLean

分类：自然语言处理 | 人工智能 | 机器学习

2022-12-21

Current language models are considered to have sub-human capabilities at natural language tasks like question-answering or writing code. However, language models are not trained to perform well at these tasks, they are trained to accurately predict the next token given previous tokes in tokenized text. It is not clear whether language models are better or worse than humans at next token prediction. To try to answer this question, we performed two distinct experiments to directly compare humans and language models on this front: one measuring top-1 accuracy and the other measuring perplexity. In both experiments, we find humans to be consistently \emph{worse} than even relatively small language models like GPT3-Ada at next-token prediction.

translated by 谷歌翻译

Mind the Knowledge Gap: A Survey of Knowledge-enhanced Dialogue Systems

Sagi Shaier , Lawrence Hunter , Katharina Kann

分类：自然语言处理 | 机器学习

2022-12-19

Many dialogue systems (DSs) lack characteristics humans have, such as emotion perception, factuality, and informativeness. Enhancing DSs with knowledge alleviates this problem, but, as many ways of doing so exist, keeping track of all proposed methods is difficult. Here, we present the first survey of knowledge-enhanced DSs. We define three categories of systems - internal, external, and hybrid - based on the knowledge they use. We survey the motivation for enhancing DSs with knowledge, used datasets, and methods for knowledge search, knowledge encoding, and knowledge incorporation. Finally, we propose how to improve existing systems based on theories from linguistics and cognitive science.

translated by 谷歌翻译

State-Regularized Recurrent Neural Networks to Extract Automata and Explain Predictions

Cheng Wang , Carolin Lawrence , Mathias Niepert

分类：机器学习

2022-12-10

Recurrent neural networks are a widely used class of neural architectures. They have, however, two shortcomings. First, they are often treated as black-box models and as such it is difficult to understand what exactly they learn as well as how they arrive at a particular prediction. Second, they tend to work poorly on sequences requiring long-term memorization, despite having this capacity in principle. We aim to address both shortcomings with a class of recurrent networks that use a stochastic state transition mechanism between cell applications. This mechanism, which we term state-regularization, makes RNNs transition between a finite set of learnable states. We evaluate state-regularized RNNs on (1) regular languages for the purpose of automata extraction; (2) non-regular languages such as balanced parentheses and palindromes where external memory is required; and (3) real-word sequence learning tasks for sentiment analysis, visual object recognition and text categorisation. We show that state-regularization (a) simplifies the extraction of finite state automata that display an RNN's state transition dynamic; (b) forces RNNs to operate more like automata with external memory and less like finite state machines, which potentiality leads to a more structural memory; (c) leads to better interpretability and explainability of RNNs by leveraging the probabilistic finite state transition mechanism over time steps.

translated by 谷歌翻译

Multi-Source Survival Domain Adaptation

Ammar Shaker , Carolin Lawrence

分类：机器学习 | 人工智能 | (统计)机器学习

2022-12-01

Survival analysis is the branch of statistics that studies the relation between the characteristics of living entities and their respective survival times, taking into account the partial information held by censored cases. A good analysis can, for example, determine whether one medical treatment for a group of patients is better than another. With the rise of machine learning, survival analysis can be modeled as learning a function that maps studied patients to their survival times. To succeed with that, there are three crucial issues to be tackled. First, some patient data is censored: we do not know the true survival times for all patients. Second, data is scarce, which led past research to treat different illness types as domains in a multi-task setup. Third, there is the need for adaptation to new or extremely rare illness types, where little or no labels are available. In contrast to previous multi-task setups, we want to investigate how to efficiently adapt to a new survival target domain from multiple survival source domains. For this, we introduce a new survival metric and the corresponding discrepancy measure between survival distributions. These allow us to define domain adaptation for survival analysis while incorporating censored data, which would otherwise have to be dropped. Our experiments on two cancer data sets reveal a superb performance on target domains, a better treatment recommendation, and a weight matrix with a plausible explanation.

translated by 谷歌翻译

AdsorbML: Accelerating Adsorption Energy Calculations with Machine Learning

Janice Lan , Aini Palizhati , Muhammed Shuaibi , Brandon M. Wood , Brook Wander , Abhishek Das , Matt Uyttendaele , C. Lawrence Zitnick , Zachary W. Ulissi

分类：机器学习

2022-11-29

Computational catalysis is playing an increasingly significant role in the design of catalysts across a wide range of applications. A common task for many computational methods is the need to accurately compute the minimum binding energy - the adsorption energy - for an adsorbate and a catalyst surface of interest. Traditionally, the identification of low energy adsorbate-surface configurations relies on heuristic methods and researcher intuition. As the desire to perform high-throughput screening increases, it becomes challenging to use heuristics and intuition alone. In this paper, we demonstrate machine learning potentials can be leveraged to identify low energy adsorbate-surface configurations more accurately and efficiently. Our algorithm provides a spectrum of trade-offs between accuracy and efficiency, with one balanced option finding the lowest energy configuration, within a 0.1 eV threshold, 86.63% of the time, while achieving a 1387x speedup in computation. To standardize benchmarking, we introduce the Open Catalyst Dense dataset containing nearly 1,000 diverse surfaces and 87,045 unique configurations.

translated by 谷歌翻译

Regression as Classification: Influence of Task Formulation on Neural Network Features

Lawrence Stewart , Francis Bach , Quentin Berthet , Jean-Philippe Vert

分类：机器学习 | 人工智能 | (统计)机器学习

2022-11-10

Neural networks can be trained to solve regression problems by using gradient-based methods to minimize the square loss. However, practitioners often prefer to reformulate regression as a classification problem, observing that training on the cross entropy loss results in better performance. By focusing on two-layer ReLU networks, which can be fully characterized by measures over their feature space, we explore how the implicit bias induced by gradient-based optimization could partly explain the above phenomenon. We provide theoretical evidence that the regression formulation yields a measure whose support can differ greatly from that for classification, in the case of one-dimensional data. Our proposed optimal supports correspond directly to the features learned by the input layer of the network. The different nature of these supports sheds light on possible optimization difficulties the square loss could encounter during training, and we present empirical results illustrating this phenomenon.

translated by 谷歌翻译